Data Visualisation 2

Ben Whalley

Nothing new under the sun: Mediaeval line plot, circa 1010. Image: wikimedia


If you haven’t already completed the first data vis worksheet, do that now.

Facets

As we add layers our plots become more complex. We run into trade-offs between information density and clarity.

To give one example, this plot shows life expectancies for each country in the development data, plotted by year:

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE)

Explanation: This is another x/y plot. However this time we have not added points, but rather smoothed lines (one for each country).

Explanation of the code:We have created an x/y plot as before, but this time we only added geom_smooth (and not geom_point), so we can’t see the individual datapoints. We have also added the text group=country which means we see one line per-country in the dataset. Finally, we also added se=FALSE. This hides the shaded area which geom_smooth adds by default.

Comment on the result: It’s pretty hard to read!


To increase the information density, and explore patterns within the data, we might add another dimension and aesthetic. The next plot colours the lines by continent:

development %>%
  ggplot(aes(year, life_expectancy, colour=continent, group=country)) +
  geom_smooth(se=FALSE)

However, even with colours added it’s still a bit of a mess. We can’t see the differences between continents easily.

To clean things up we can use a technique called facetting.

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE) +
  facet_grid(~continent)

Explanation: We added the code + facet_grid(.~continent) to our previous plot, and removed the color aesthetic. This made ggplot create individual panels for each continent. Splitting the graph this way makes it somewhat easier to compare the differences between continents.

Try facetting yourself

Use the iris dataset, which is built into R.

  1. Try to recreate this plot by adapting the code shown above

  2. Create a second plot which uses colours to distinguish species and does not use Facets

  3. In this example, which plot do you prefer? What influences when facets are more useful the just using colour?

iris %>%
  ggplot(aes(Sepal.Length, Petal.Length)) +
  geom_point()  +
  geom_smooth() +
  facet_grid(~Species)
iris %>%
  ggplot(aes(Sepal.Length, Petal.Length, color=Species)) +
  geom_point()  +
  geom_smooth() 

There’s no single right answer here, but for this example I prefer the coloured plot to the faceted one. The reason is that there are only 3 species in this dataset, and the points for each don’t overlap much. This means it is easy to distinguish them, even in the combined plot.

However, if there were many different species it might be helpful to use facets instead.

Our decisions should be driven by what we are trying to communicate with the plot. What was the research question that motivated us to draw it?

Adjusting the facets

In the explanation above, we saw this plot:

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE) +
  facet_grid(~continent)

Re-run the plot and make the following adaptations:

  1. Try replacing facet_grid(~continent) with facet_grid(continent~.). What happens?

  2. Try replacing facet_grid with facet_wrap(~continent). What happens?

  3. If you have time, to see more faceting examples, see the ggplot cookbook documentation. Try to re-create some of these examples.

Comparing categories

In the examples above we have been plotting continuous variables (and adding colours etc). We’ve used density, scatter and smoothed line plots to do this.

Another common requirement is to use plots to compare summary statistics for different groups or categories. For example, the classic plot in a psychology study looks like this:

However, there is evidence that readers often misinterpret bar plots. Specifically, the problem is that we perceive values within the bar area as more likely than those just above, even though this is not in fact the case.

A better choice is (almost always) to use a boxplot:

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  geom_boxplot()

Explanation: We used Condition, a category, as our x axis, and reaction times as the y axis. We added geom_boxplot to show a boxplot.

If you’re not familiar with boxplots, there are more details in the help files (type ?geom_boxplot into the console) or use the wikipedia page here

Loading data over the internet

So far we have always used data from the psydata package. However, we can also load data from files or over the internet using the read_csv() function. This is an example:

expdata <- read_csv('https://gist.githubusercontent.com/benwhalley/f94baf447612e2434b181739dbba27df/raw/43df26022fff68f49918c795f27d7352dc0d3425/expdata.csv')

expdata %>% glimpse
Rows: 240
Columns: 4
$ Condition <chr> "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", …
$ stimuli   <chr> "S1", "S1", "S1", "S2", "S2", "S2", "S3", "S3", "S3", "S4", "S4", "S4", "S1", "S…
$ p         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,…
$ RT        <dbl> 151.84776, 313.68298, 556.61563, 53.63154, 560.36673, 586.03610, 504.98535, 271.…

Explanation: The part which starts expdata <- read_csv('https://... loads a CSV (data) file over the internet, and stores it in the variable called expdata. You can click this link to see the raw data. We then use glimpse to check the data

Load the (simulated) dataset from this url: https://gist.githubusercontent.com/benwhalley/f94baf447612e2434b181739dbba27df/raw/43df26022fff68f49918c795f27d7352dc0d3425/expdata.csv (note that this URL is very long — make sure you copy the whole of it).

  • Recreate the boxplot above
  • Use a facet to recreate the plot you saw above, combining both Condition and Stimuli

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  geom_boxplot() +
  facet_wrap(~Condition)

Other data summaries

If you really need to plot the mean and standard error or standard deviations of different categories, ggplot also has the stat_summary command:

expdata %>%
  ggplot(aes(Condition, RT)) + 
  stat_summary()

Explanation: We used Condition and RT as our x and y axes, as before. This time we added stat_summary() instead of geom_boxplot(). By default this plots the mean and standard error (a measure of variability) in each group, using a point-range plot. This is better than a bar chart because it avoids a known bias in how we read them. You can ignore the warning about No summary function supplied, defaulting to mean_se() for now.

  • Adapt your facetted boxplot from above to show the mean and standard error instead

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  stat_summary() +
  facet_wrap(~Condition)

Multiple summary layers

A “violin plot” is a cross between a boxplot and a density plot.

They are simple ot make in ggplot; we can simply replace geom_boxplot() with geom_violin():

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  geom_violin() +
  facet_wrap(~Condition)

However many authors find it helpful to overlay additional summaries like the mean and standard error of a variable. For example:

Make a new plot, based on the code above, which combines geom_violin() with stat_summary()

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  geom_violin() +
  stat_summary() +
  facet_wrap(~Condition)

Decisions about grouping

Consider the following two plots. Both present the same data on the incomes of men and women who are either blind or normally sighted:

  1. Which plot do you prefer, and why?

I don’t have a strong overall preference between the plots, but I think plot A is probabably more useful. My reasoning is that in this plot we can see that:

  1. Men are paid more
  2. Blind people are paid less
  3. The gender-pay gap appears to be larger for blind people

I think that third point is slightly harder to read/understand from plot B.

The idea to take-away is that grouping the data in the plot in different ways can make it easier or more difficult to identify different patterns.

  1. Re-organize the plot below to make it easier to see which level of education has the highest gender pay gap.
earnings  %>%
  ggplot(aes(x=education, y=income/1000)) + 
  facet_grid(~gender) + 
  stat_summary() +
  labs(y = "Income (£ 1000)", x="Gender")

To reorganise the panels, you will need to amend the variables used to facet the plot.

earnings  %>%
  ggplot(aes(x=gender, y=income/1000)) + 
  facet_grid(~education) + 
  stat_summary() +
  labs(y = "Income (£ 1000)", x="Gender")

  1. Try adding scale_y_log10() to your updated plot. How does this change the impression you have of the data? Which plot is the most “truthful”?

By adding a log scale to the gender pay gaps appear to be more equal across different education categories.

earnings  %>%
  ggplot(aes(x=gender, y=income/1000)) + 
  facet_grid(~education) + 
  stat_summary() +
  scale_y_log10() +
  labs(y = "Income (£ 1000)", x="Gender")

Either plot can be considered more or less ‘truthful’, depending on your perspective:

  • If you believe a £2000 pay gap is the same for people earning around £20,000 as it is for those earning £100,000 then the first plot is more helpful. We can see there that largest pay gaps are among the most educated and so highest-earning.

  • If we consider that £2000 means less to those earning very high incomes, then the second plot is more helpful. This plot shows that, relative to total income, the size of the pay gap is pretty consistent across education categories.

However the elephant in the room is that income has a very skewed distribution:

earnings %>% 
  ggplot(aes(income)) + geom_density()

This means that medians for men/women are very different to the means, and the median gender pay gap is different to the mean gender pay gap:

earnings  %>%
  group_by(blind, gender) %>% 
  summarise(m=mean(income)) %>% 
  group_by(blind) %>% 
  summarise(diff(m)) %>% 
  set_names(c("Vision", "Mean pay gap (£)")) %>% 
  pander::pander()
Vision Mean pay gap (£)
Blind 20370
Normally sighted 8473
earnings  %>%
  group_by(blind, gender) %>% 
  summarise(m=median(income)) %>% 
  group_by(blind) %>% 
  summarise(diff(m)) %>% 
  set_names(c("Vision", "Median pay gap (£)")) %>% 
  pander::pander()
Vision Median pay gap (£)
Blind 1735
Normally sighted 8222

The moral here is that different summaries of the data tell different stories: We should explore our data thoroughly, and avoid jumping to conclusions!

More practice

Scenario 1: Secret agent

You are an MI6 agent, and have been sent a mystery dataset by one of your spies. She said it contains highly important information which will be of great interest to your superiors.

  1. Use your ggplot wizardry (and techniques we have covered today) to recover this classified information and present it in a single graphic for Q.

You can load the data using a new command, read_csv(). This command accepts either the name of a file, or a URL that the data is stored on the web.

The file URL is: https://gist.githubusercontent.com/benwhalley/2103941f0c2ac0449247f59fe9c976b6/raw/4630367728610f0a9937d71eeac6fb511065ee46/mystery.csv

To load the data, you might write:

mystery <- read_csv('https://gist.githubusercontent.com/benwhalley/2103941f0c2ac0449247f59fe9c976b6/raw/4630367728610f0a9937d71eeac6fb511065ee46/mystery.csv')

mystery %>% glimpse
Rows: 1,846
Columns: 3
$ category <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D",…
$ x        <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410, 33.0769, 28.9744…
$ y        <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718, 77.5641, 74.4872…

Scenario 2: Admissions tutor

In the 1970s the University of California, Berkley, was concerned about the fairness of their admissions procedures. They collected data from across the university for a number of years, recording the:

  • Number of applicants
  • The department the student applied to
  • The students’ gender
  • Number of students accepted
  • The percentage students of each gender who were accepted in each department

A summary of these data are available at this link: Berkley applications data

Your job is to:

  • Describe the pattern of applications
  • Decide if the university was fair in it’s admissions procedures
  • Prepare a short presentation for the university governors which includes plots

Techniques/commands you might want to use:

  • filter
  • group_by and summarise
  • stat_summary
  • facet_wrap() or facet_grid()
berkley <- read_csv('https://raw.githubusercontent.com/benwhalley/datafluency/master/website/data/berkley.csv')
berkley %>% glimpse
Rows: 72
Columns: 6
$ year             <dbl> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976,…
$ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Female", "Male", "Ma…
$ department       <chr> "A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F", "A", "B", "…
$ n.applicants     <dbl> 133, 34, 785, 462, 511, 444, 1095, 740, 416, 536, 234, 485, 145, 33, 76…
$ n.admitted       <dbl> 107, 20, 314, 146, 149, 318, 725, 455, 141, 205, 62, 297, 108, 21, 256,…
$ percent.admitted <dbl> 80.45, 58.82, 40.00, 31.60, 29.16, 71.62, 66.21, 61.49, 33.89, 38.25, 2…

This plot suggests a paradox: Although more men applied and were admitted, the proportion of male applicants who were successful was actually lower than for women:

Breaking the data down by-department reveals what is happening:

Note that you don’t (yet) know how to produce the plot above. We will cover this in the next sessions, but for now you could make two individual plots, like so:

This is an example of Simpson’s paradox. You can read more about the case here: http://corysimon.github.io/articles/simpsons-paradox/

Check your knowledge

  1. What is a facet?
  2. When might we use facets in our work?
  3. What is the difference between facet_wrap and facet_grid?
  4. What statistics does a boxplot show?
  5. What type of plot does stat_summary() make by default?
  6. What is the default statistic presented by stat_summary?
  7. How can we decide between different types of plots when visualising data?